Learning in Restless Bandits Under Exogenous Global Markov Process
نویسندگان
چکیده
We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where exogenous global Markov process governs rewards distribution of each arm. Under state, evolves according Markovian rule, which is non-identical among different arms. At time, a player chooses out $N$ arms play, and receives random reward from finite set states. The are restless, that is, their local state regardless player's actions. Motivated by recent studies on related RMAB settings, regret defined as loss respect knows dynamics problem, plays at time $t$ maximizes expected immediate value. objective develop arm-selection policy minimizes regret. To end, we Learning under Exogenous Process (LEMP) algorithm. analyze LEMP theoretically establish finite-sample bound show achieves logarithmic order time. further numerically present simulation results support theoretical findings demonstrate significantly outperforms alternative algorithms.
منابع مشابه
Regret Bounds for Restless Markov Bandits
We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner’s actions. We suggest an algorithm that after T steps achieves Õ( √ T ) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we sho...
متن کاملLearning of Uncontrolled Restless Bandits with Logarithmic Strong Regret
In this paper we consider the problem of learning the optimal dynamic policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when played yields a non-negative reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a...
متن کاملCompeting Bandits: Learning Under Competition
Most modern systems strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We initiate a study of the interplay between exploration and competition—how such systems balance the exploration for learning and the competition for users. Here the users play three distinct roles: they are customers...
متن کاملOpportunistic Scheduling as Restless Bandits
In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and differ across users; also, the cost incurred, i...
متن کاملParticle Filtering And Restless Bandits 1 Running Head: PARTICLE FILTERS AND RESTLESS BANDITS Modeling Human Performance in Restless Bandits with Particle Filters
Bandit problems provide an interesting and widely-used setting for the study of sequential decision-making. In their most basic form, bandit problems require people to choose repeatedly between a small number of alternatives, each of which has an unknown rate of providing reward. We investigate restless bandit problems, where the distributions of reward rates for the alternatives change over ti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Signal Processing
سال: 2022
ISSN: ['1053-587X', '1941-0476']
DOI: https://doi.org/10.1109/tsp.2022.3224790